Tolerating SEU Faults in the Raw Architecture

نویسندگان

  • Karandeep Singh
  • Adnan Agbaria
  • Dong-In Kang
  • Matthew French
چکیده

This paper describes software fault tolerance techniques to mitigate SEU faults in the Raw architecture, which is a single-chip parallel tiled computing architecture. The fault tolerance techniques we use are efficient Checkpointing and Rollback of processor state, Break-pointing, Selective Replication of code and Selective Duplication of tiles. Our fault tolerance techniques can be fully implemented in the software, without any changes to the architecture, transparent to the user, and designed to fulfill run-time performance and throughput requirements of the system. We illustrate these techniques by mitigating matrix multiply kernel mapped on Raw. The proposed techniques are also applicable to other tiled architectures (and also parallel systems in general).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tolerating Faults in a Mesh with a Row of Spare Nodes

Bruck, J., R. Cypher and C.-T. Ho, Tolerating faults in a mesh with a row ofspare nodes, Theoretical Computer Science 128 (1994) 241-252. We present an efficient method for tolerating faults in a two-dimensional mesh architecture. Our approach is based on adding spare components (nodes) and extra links (edges) such that the resulting architecture can be reconfigured as a mesh in the presence of...

متن کامل

Axo: Tolerating Delay Faults in Real-Time Systems

We address delay faults: faults that cause a software component to take more time for completing an action than a given deadline. Such faults are particularly of interest in realtime mission-critical control applications that use general-purpose computing platforms to compute setpoints. A violation of realtime constraints associated with setpoints can result in failure. Existing benign and Byza...

متن کامل

An Adaptive Algorithm for Tolerating Value Faults and Crash Failures

The AQuA architecture provides adaptive fault tolerance to CORBA applications by replicating objects and providing a high-level method that an application can use to specify its desired level of dependability. This paper presents the algorithms that AQuA uses, when an application’s dependability requirements can change at runtime, to tolerate both value faults in applications and crash failures...

متن کامل

Bridging the Gap between Hardware and Software Fault Tolerance

During the last decades several mechanisms for tolerating errors caused by software (design) faults have been put forward. Unfortunately only few experimental programming languages have incorporated them, so these schemes are not available in programming languages and systems that are used in developing modern applications. This is why programmers must either implement these mechanisms themselv...

متن کامل

Fault-Tolerant Meshes and Hypercubes with Minimal Numbers of Spares

Many parallel computers consist of processors connected in the form of a d-dimensional mesh or hypercube. Twoand three-dimensional meshes have been shown to be efficient in manipulating images and dense matrices, whereas hypercubes have been shown to be well suited to divide-andconquer algorithms requiring global communication. However, even a single faulty processor or communication link can s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006